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Abstract. 

Understanding the inverse equivalent width - luminosity relationship 
(Baldwin Effect), the topic of this meeting, requires extracting informa- 
tion on continuum and emission line parameters from samples of AGN. 
We wish to discover whether, and how, different subsets of measured 
parameters may correlate with each other. This general problem is the 
domain of Principal Components Analysis (PC A). We discuss the pur- 
pose, principles, and the interpretation of PCA, using some examples 
from QSO spectroscopy. The hope is that identification of relationships 
among subsets of correlated variables may lead to new physical insight. 



1. Introduction 

The point of all statistics is simplification. The amount of data the world can 
throw at us would swamp Einstein: we have to simplify to survive. Statistics is 
the art of extracting simple comprehensible facts that tell us what we want to 
know for practical reasons, from the floods of data washing over us. 

Consider the fuel consumption of cars, for example. Every car will be 
different, depending on its model, year, maintenance state, and the aggression 
level of its driver. To fully characterize the fuel economy of cars in the USA would 
require a different number for every car/driver combination: that is, more than 
10 8 numbers. For most purposes, however, such as working out the nation's 
likely oil usage, these 10 8 numbers can be replaced with one: the average fuel 
consumption. An enormous simplification! 

Principal Components Analysis (PCA) is a tool for simplifying one partic- 
ular class of data. Imagine that you have n objects (where n is large), and you 
can measure p parameters for each of them (where p is also large). For example, 
the objects could be the n QSO researchers attending a meeting in La Serena, 
and the parameters could be the p things you know about each of them: e.g., 



1 This paper is for our friends Leah Cutter and Mike Brotherton, who announced their engage- 
ment during the La Serena meeting. 
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their heights, weights, number of publications, frequent flier miles and the fuel 
consumption of their cars. Let us imagine that you want to investigate how 
these p parameters are related to each other. For example, do astronomers who 
spend most of their lives in airports publish more? Does this depend on how fat 
they are? Do people with inefficient cars fly more, or is it only the smart ones 
(with lots of publications) that do so? Do these correlations represent real causal 
connections, or it it just that once you get tenure you buy a new car, become 
fat, stop publishing and give lots of invited talks in exotic foreign locations? 

The traditional way of dealing with this type of problem is to plot everything 
against everything else and look for correlations. Unfortunately, as the number 
of parameters increases, this rapidly becomes impossibly complicated. It is easy 
to get lost in the web of parameters, each of which correlates more or less well 
with some combination of the other parameters. The human brain can cope with 
two or three parameters easily. By plotting all the different variables against 
each other separately, we can just about learn something about 5-7 variables. 
But once we are beyond this, the human brain needs help. 

PCA is specifically designed to help in situations like this: when you know 
lots of things about lots of objects, and want to see how all these properties are 
inter-related. Basically, PCA looks for sets of parameters that always correlate 
together. By grouping these into one new parameter, an enormous saving in 
complexity can be achieved with minimal loss of information. 

PCA is one of a family of algorithms (known as multivariate statistics) 
designed to handle complex problems of this sort. It was first widely applied 
in the social sciences. The most infamous early application of PCA was to 
intelligence testing. You can test the intellectual ability of people in many 
ways. For example, you could give a sample of n people a set of p exams, with 
questions testing their creativity, memory, math skills, verbal skills etc. Do 
people who score well on one test score well on all? Or do the scores break 
up into sub-groups, such as verbal or logical scores, which correlate well with 
the scores on other similar tests? PCA was applied to these exercises, and it 
was found that nearly all the scores correlate well with each other. Thus, it was 
claimed, a single underlying variable (known as IQ) can be used to replace all the 
individual scores, and once you know someone's IQ, you can accurately predict 
their performance on all the tests. (See Steven Jay Gould's 'The Mismeasure of 
Man' for a hilarious account of the misuse of this application of PCA.) 

2. Overview 

The task of PCA is then, given a sample of n objects with p measured quantities 
for each, i.e. p variables, Xj (j = 1, . . . , p), find a set of p new, orthogonal (i.e. 
independent) variables, £i, . . . , £j, . . . , £ p , each one a linear combination of the 
original variables, xf 

Determine the constants such that the smallest number of new variables 
account for as much of the variance of the sample as possible. The £j are called 
principal components. 
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If most of the variance in the original data can be accounted for by just a few 
of the p new variables, we will have found a simpler description of the original 
dataset. A smaller number of variables may point to a way of classifying the 
data. More interesting, beyond the realm of statistical description, the PCA, by 
showing which original variables correlate together, may lead to new physical 
insight. Of course it will sometimes happen that the observed variables are 
uncorrelated, or at least, lead to no dominant principal components. That may 
be useful to know, but not very interesting. 

The concept of PCA is usually introduced either algebraically, through co- 
variance matrices, or geometrically. We will first give a geometrical overview, 
then illustrate with examples and interpretation. Many textbooks on multivari- 
ate statistics give rigorous mathematical treatments (e.g., Kendall 1980). 

3. A Geometrical Approach to Principal Components Analysis 

Consider the case of p variables. The data of n QSOs are represented by a 
large cloud in p-dimensional space. If two or more parameters are correlated, 
the cloud will be elongated along some direction in hyper-space defined by their 
axes. Large extensions can arise when a few parameters are correlated, or when 
smaller correlated variations occur for a substantial number of variables. 

PCA identifies these extended directions and uses them as a set of axes 
for the parameterization of the multidimemsional space. Following the analysis, 
each QSO can be represented by its coordinates in the new space. The new axes 
are identified sequentially: PCA first finds the most extended direction in the 
original p-dimensional space by minimizing the sums of squares of the deviations 
from that direction. This direction forms the first principal component (often 
called eigenvector 1), and accounts for the largest single linear variation among 
measured QSO properties. Next we consider the (p — l)-dimensional hyper-plane 
orthogonal to the first principal component. We then search for the direction 
that represents the greatest variance in (p — l)-space, thus defining the second 
principal component. This process is continued, defining a total of p orthogonal 
directions. 

4. Examples using Real Data 
4.1. PCA with Two Variables 

Consider the case of 22 QSOs each with measured values of X-ray spectral 
index ct x (defined by oc v~ ax between 0.15 keV and 2keV), and FWHM 
H/3 (full width at half maximum for the broad H/3 emission line). The data 
points are distributed in an elongated cloud in 2 dimensions, as shown in Fig. 
1. It is standard practice to subtract the mean value from each variable, and 
normalize by dividing by the standard deviation. One can find the direction of 
the first principal component axis by rotating an axis to align with the direction 
of maximum elongation, actually maximum variance, of the data. The result 
of this is shown by the dashed line labeled PCI in Fig. 2. Because the points 
remain the same distance from the origin, by Pythagoras' theorem, maximizing 
the variance along PCI is equivalent to minimizing the sums of squares of the 
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Figure 1. An important optical correlation, soft X-ray spectral in- 
dex vs. width of the broad H/3 emission line. Left: In natural units. 
Right: In normalized units, with mean subtracted, then divided by the 
standard deviation. The dashed line shows the direction of the first 
principal component (PCI), representing the maximum deviation of 
the cloud of data points. Dotted lines project the data points onto this 
direction. PCI represents the direction that minimizes the sums of the 
squares of the lengths of the dotted lines. The value (score) of PCI 
for a given point is the distance of the point from the origin, projected 
onto PCI. Similarly the lengths of the dotted lines represent the values 
of PC2 for each data point. 



distances of the points from this line through the origin; these distances are 
shown as dotted lines. The distance of a point from the origin, projected onto 
the direction PCI represents the value (score) of the first principal component 
for that data point. Clearly, PCI is a linear combination of the original input 
variables. The variance of PCI is 1.764 - rather more than the unit variance 
of the original variables. The total variance of the sample is the sum of the 
variances for each variable, in this case, 2. Because the new coordinates are 
found simply by rotation, the distance of the points from the origin remains 
unchanged, so the total variance remains the same. Thus the first principal 
component accounts for 1.764/2 = 88.2% of the total sample variance. The 
remaining variance of the sample can be accounted for by the projection of the 
data points onto the axis PC2, perpendicular to PCI - or 0.236/2 = 11.8%. 
These projections (lengths of the dotted lines) are the values or scores of the 
second principal component. 

We have succeeded in defining a new variable, a linear combination of ot x 
and log FWHM H/3, that accounts for most of the variation within the sample 
(PCI). The interpretation of this parameter is a hotly contended topic (e.g., 
Pounds, Done & Osborne 1995, Laor et al. 1997, Brandt & Boiler 1998). Is 
PC2 of any significance? The astronomer, with knowledge of the measurement 
uncertainties, may have more hope of addressing this. If the original variables 
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had been uncorrelated, we could still define PCI and PC2 mathematically, but 
we would be no better off as a result of the analysis. 

4.2. PCA with More Variables 



Table 1. Input Data 
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Number 


22 


21 


22 


22 


22 


21 


Mean 


45.56 


1.57 


3.498 


0.56 


0.99 


3.446 


Std dev'n 


0.47 


0.38 


0.212 


0.40 


0.47 


0.129 



"Log of continuum luminosity at 1216A in units of erg s 1 (H = 50 km s 1 Mpc 1 , q = 0.5.) 
FWHM are in km s _1 ; rest-frame equivalent widths (EW) are in A. 

PCA achieves its real usefulness in multivariate problems. We perform a 
PCA 1 on the small sample of 22 QSOs discussed by Wills et al. (1998a,b), 
using a subset of the available measured properties shown in Tables 1 and 2. 
Unavoidably, there are missing data, so the number of objects available depends 



Several widely available statistical packages include a task for Principal Components Analyses 
(Statistical Package for the Social Sciences - SSPS, Statistical Analysis System - SAS, Minitab 
- Minitab Reference Manual 1992). 
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on the variables chosen for the PCA. Many correlations are plotted, and the 
Pearson correlation coefficients tabulated, by Wills et al. (this volume). 



Table 2. Input Data, continued 
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20 


22 


20 


21 


21 


20 


22 


Mean 


2.11 


1.78 


0.421 


1.279 


0.362 


0.212 


0.108 


Std dev'n 


0.21 


0.24 


0.131 


0.194 


0.199 


0.091 


0.043 



Tables 1 and 2 also present, for each variable, the number of data points, 
the mean and the standard deviation. Notice the completely different units 
for different measured parameters. Clearly, from our two dimensional example, 
one can see in Fig. 1 or 2 that the deviations from PCI, hence PCI itself, 
will depend on the units chosen (the weighting of the variables). In order to 
weight the variables more or less equally, after subtracting the mean values, we 
normalize by the variance. The choice of weights is a difficult issue, and depends 
on the user's knowledge of the data, and preferences, as well as the use to which 
the results will be put. The results of performing a PCA on these normalized 
variables are shown in Table 3. Columns (2)— (6) show the first 5 out of a total of 
13 principal components. The first row gives the variances (eigenvalues) of the 
data along the direction of the corresponding principal component. The sums 
of all the variances add up to the sums of the variances of the input variables, 
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in this case, 13. By convention, the principal components are given in order 
of their contribution to the total variance. This is given as 'Proportion' in the 
second line, and the 'Cumulative' proportion on the third line. Thus, among 
the parameters we have chosen to use, the first principal component contributes 
50% of the spectrum-to-spectrum variance, the second 22%, the third, 12%. The 
first two principal components together contribute 71% of the variance, the first 
3, 84%, and the first 4, nearly 90%. 

Table 3. Results of Eigenanalysis - The Principal Components" 





PCI 


PC2 


PC3 


PC4 


PC5 


Eigenvalue 


6.4505 


2.8157 


1.5879 


0.6257 


0.5698 


Proportion 


0.496 


0.217 


0.122 


0.048 


0. 


.044 


Cumulative 


0.496 


0.713 


0.835 


0.883 


0. 


,927 


Variable 


PCI 


PC2 


PC3 


PC4 


PC5 


log L1216 


0.053 


0.535 


-0.123 


-0.029 


-0. 


,405 


ct x 


0.295 


-0.198 


0.079 


0.485 


-0. 


455 


FWHM H/3 


-0.330 


0.077 


-0.357 


-0.082 


-0. 


441 
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0.003 


-0.487 


-0. 
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-0.311 
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,116 
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0.231 


-0.050 


-0.573 


0.107 


-0. 


.288 


A1400/Lya 


0.223 


-0.351 


-0.225 


0.441 


-0. 


,216 



a 18 of 22 QSO spectra used; 4 cases contain missing values. 

The columns of numbers for each principal component represent the weights 
assigned to each input variable. Thus PCI = 0.053x2;i+0.295xx 2 -0.330xx3+ 
. . . , where x±, X2, and X3 are the values of the normalized variables corresponding 
to log L1216, a x , FWMH H/3, etc. By convention these weights are chosen so 
that the sum of their squares = 1. This arbitrarily fixes the scale of the new 
variable. The sign of the new variable is therefore arbitrary. 

4.3. Interpretation 

The first principal component is elongated with variance 6.5 times that of any 
individual measurements, and accounts for about half the total variance. This 
is therefore likely to be highly significant. If all measured, normalized quantities 
contributed equally to PCI, they would all have weight 0.277 (1/VT3 for 13 
variables), but each variable contributes more or less than this. One way to test 
the significance of the contribution of any one measured variable, is to perform 
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the PCA without that variable, then check the significance of the correlation 
between that variable and the scores of the new principal component. This 
procedure shows that all measured variables except L1216, log FWHM CIII], 
and log EW Lya, correlate with PCI, but correlations involving NV/Lya and 
A1400/Lya are not very strong. PC2, accounting for 22% of the variance in this 
dataset, appears to link the EW Lya, EW CIV, and EW CIII] with L1216, so 
EW CIV and EW CIII] appear to contribute to both PCI and PC2, but EW 
Lya contributes predominantly to PC2. Is PC2 a significant component? A 
similar correlation test shows that individually the EWs do anti-correlate with 
L1216) but this result depends on the lowest EWs for the highest luminosity QSO 
PG1226+023 and the highest EWs for the low luminosity QSO PG1202+281. 
However L1216 correlates significantly (Pearson's ordinary correlation coefficient 
= —0.77) with PC2 formed when L1216 is excluded. Thus there is a significant 
overall correlation between EW and L1216, although a larger sample is clearly 
needed to investigate the individual EW correlations. Another test may be to 
check correlations between observed measurements for those measurements that 
contribute to only one significant principal component - for example, CIV/Lya 
vs. FeII/H/3 (see Fig. and Table of Wills et al. in this volume). 

As a rule-of-thumb, any principal component with variance greater than 
1, should be considered seriously. It is also worth investigating any principal 
component with variance rather greater than that of the remaining principal 
components. In our example, this could mean the first three principal compo- 
nents. 

PCA is a linear analysis. Tests should be performed to check on the lin- 
earity of the principal components. If a linear analysis is valid, plotting the 
scores of PCI vs. PC2 should show a normal distribution consistent with no 
correlation between the two. Mathematically, there cannot be a correlation, but 
a non-random distribution of points, or individual outlying points, may indicate 
non-linearity of the relationships - or some other problem with the uniformity 
of the data-set. Outliers could be rejected and the analysis repeated, or a trans- 
form of co-ordinates, for example to logarithmic co-ordinates, may reduce the 
problem to a linear analysis. A PCA performed using the ranks rather than 
the actual (normalized) measurements may be more robust to both non-random 
distributions and outliers. (Compare the present results with those from the 
analysis of the ranks, in Table 2 of the other PCA paper in this volume.) These 
tests are an important tool for examining non-linearities in the data, and for 
discovering individual unusual objects. 

5. Some Examples from the Literature 

Increasing awareness of statistical methods has led to the establishment of the 
Statistical Consulting Center For Astronomy at Penn State University (Akritas 
et al. 1997, Feigelson et al. 1995, see also http://www.stat.psu.edu/scca/ and 
www.astro.psu.edu/statcodes), and a series of conference and other volumes 
devoted to statistics in astronomy (Murtagh & Heck 1987; Feigelson, Babu, & 
Jogesh 1992), including PCA. 

PCA is being increasingly applied in astrophysics. Investigations of low and 
high redshift galaxies depend on their classification (by morphology, photome- 
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try, kinematics, etc.), in terms of the purely observational "fundamental plane", 
a subspace of the p-dimensional parameter space (Djorgovski & Davis 1987, see 
also an interesting PCA paper by M. Han 1995). The same area of astronomy 
has also extensively applied 'neural network' techniques for the automated clas- 
sification of galaxy images, and more (Odewahn 1998; Rawson, Bailey & Francis 
1996.) 

An example similar to that presented here, using a subset of the param- 
eters we consider, but for a much larger sample, is provided in the paper by 
Boroson & Green (1992). Other examples of PCA analyses are given and dis- 
cussed by Whitney (1983a, b), and Murtagh and Heck (1987). Some PC As 
have a larger number of variables than input observables, p > n. This results 
in a singular matrix and therefore requires modifications to the techniques to 
solve the eigenvector equations. These techniques are discussed, for example, 
by Wilkinson (1978), and mentioned by Mittaz, Penston, & Snijders (1990). 
This situation occurs in 'Spectral PCA'. The principles are identical, but the 
number of variables is larger than the number of QSO spectra. Here the QSO 
spectra are divided into many discrete bins, by wavelength or log (wavelength) 
(or velocity), and the p variables are the fluxes in these p bins. An excellent 
example and discussion of interpretation is given by Francis et al. (1992) 2 . For 
another example, see Wills, Brotherton, Wills & Thompson (1997). Spectral 
PCA also finds application to spectral time variability. For example, Mittaz et 
al. (1990) analyze the spectra of NGC4151 at 59 epochs, binning each spectrum 
in wavelength space (1375 bins). A more recent example is given by Tiirler & 
Courvoisier (1998). 

Recommended for further reading, is chapter 6 from Manly 's 'Multivariate 
Statistical Methods' (1994), which gives a good brief discussion of the method, 
with useful insights into interpretation. A more rigorous mathematical treat- 
ment, together with discussion, is given by the great researcher and expositor of 
statistics M. Kendall (Chapters 1 and 2 of 'Multivariate Analysis'.) 
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